Goto

Collaborating Authors

 video content


Supplementary Material Infer Induced Sentiment of Comment Response to Video: A New Task, Dataset and Baseline Qi Jia 1 Baoyu Fan 2,1 Cong Xu1 Lu Liu

Neural Information Processing Systems

This section provides a comprehensive overview of the CSMV dataset. This extensive time range allows for the inclusion of a diverse set of content, capturing the evolution of sentiments over the course of more than two years. The distribution of labels in our CSMV dataset is shown in Figure 1. In Figure 1a, the opinion labels are distributed as follows: positive - 47%, neutral - 42%, and negative - 11%. Negative comments are clearly in the minority.





Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding

Mao, Mingyang, Perez-Cabarcas, Mariela M., Kallakuri, Utteja, Waytowich, Nicholas R., Lin, Xiaomin, Mohsenin, Tinoosh

arXiv.org Artificial Intelligence

To effectively engage in human society, the ability to adapt, filter information, and make informed decisions in ever-changing situations is critical. As robots and intelligent agents become more integrated into human life, there is a growing opportunity-and need-to offload the cognitive burden on humans to these systems, particularly in dynamic, information-rich scenarios. To fill this critical need, we present Multi-RAG, a multimodal retrieval-augmented generation system designed to provide adaptive assistance to humans in information-intensive circumstances. Our system aims to improve situational understanding and reduce cognitive load by integrating and reasoning over multi-source information streams, including video, audio, and text. As an enabling step toward long-term human-robot partnerships, Multi-RAG explores how multimodal information understanding can serve as a foundation for adaptive robotic assistance in dynamic, human-centered situations. To evaluate its capability in a realistic human-assistance proxy task, we benchmarked Multi-RAG on the MMBench-Video dataset, a challenging multimodal video understanding benchmark. Our system achieves superior performance compared to existing open-source video large language models (Video-LLMs) and large vision-language models (LVLMs), while utilizing fewer resources and less input data. The results demonstrate Multi- RAG's potential as a practical and efficient foundation for future human-robot adaptive assistance systems in dynamic, real-world contexts.


Laugh, Relate, Engage: Stylized Comment Generation for Short Videos

Ouyang, Xuan, Wang, Senan, Wang, Bouzhou, Xiahou, Siyuan, Zhou, Jinrong, Li, Yuekang

arXiv.org Artificial Intelligence

Short-video platforms have become a central medium in the modern Internet landscape, where efficient information delivery and strong interactivity are reshaping user engagement and cultural dissemination. Among the various forms of user interaction, comments play a vital role in fostering community participation and enabling content re-creation. However, generating comments that are both compliant with platform guidelines and capable of exhibiting stylistic diversity and contextual awareness remains a significant challenge. We introduce LOLGORITHM, a modular multi-agent system (MAS) designed for controllable short-video comment generation. The system integrates video segmentation, contextual and affective analysis, and style-aware prompt construction. It supports six distinct comment styles: puns (homophones), rhyming, meme application, sarcasm (irony), plain humor, and content extraction. Powered by a multimodal large language model (MLLM), LOLGORITHM directly processes video inputs and achieves fine-grained style control through explicit prompt markers and few-shot examples. To support development and evaluation, we construct a bilingual dataset using official APIs from Douyin (Chinese) and YouTube (English), covering five popular video genres: comedy skits, daily life jokes, funny animal clips, humorous commentary, and talk shows. Evaluation combines automated metrics originality, relevance, and style conformity with a large-scale human preference study involving 40 videos and 105 participants. Results show that LOLGORITHM significantly outperforms baseline models, achieving preference rates of over 90% on Douyin and 87.55% on YouTube. This work presents a scalable and culturally adaptive framework for stylized comment generation on short-video platforms, offering a promising path to enhance user engagement and creative interaction.


AVA: Towards Agentic Video Analytics with Vision Language Models

Yan, Yuxuan, Jiang, Shiqi, Cao, Ting, Yang, Yifan, Yang, Qianqian, Shu, Yuanchao, Yang, Yuqing, Qiu, Lili

arXiv.org Artificial Intelligence

AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively-significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%. The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at https://huggingface.co/datasets/iesc/Ava-100.


Supplementary Material Infer Induced Sentiment of Comment Response to Video: A New Task, Dataset and Baseline Qi Jia 1 Baoyu Fan 2,1 Cong Xu1 Lu Liu

Neural Information Processing Systems

This section provides a comprehensive overview of the CSMV dataset. This extensive time range allows for the inclusion of a diverse set of content, capturing the evolution of sentiments over the course of more than two years. The distribution of labels in our CSMV dataset is shown in Figure 1. In Figure 1a, the opinion labels are distributed as follows: positive - 47%, neutral - 42%, and negative - 11%. Negative comments are clearly in the minority.